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Abstract 

Term grouping and thesaurus methods have frequently been incorporated 
into automatic content analysis programs as devices for the recognition of 
synonymous expressions and of linguistic entities that may be semantically 
similar but syntactically distinct- While it has frequently been asserted 
that the recognition of synonyms is essential in language analysis, actual 
proofs of the usefulness of a thesaurus in automatic information retrieval 
are outstanding ♦ 

In the present study, formal proofs are given of the effectiveness 
under well-defined conditions of *che thesaxirus method in information retrieval ♦ 
It is shown, in particular, that when certain semantically related terms are 
added to the information queries originally submitted by the user population, 
a superior retrieval system is obtained in the sense that for every level of 
the recall the retrieval precision is at least as good for the altered queries 
as for the original ones* 

!♦ Introduction 

A good deal is known about the representation of document content and the 
assignment of effective content identifiers (index terms, keywords, descriptors) . 
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to documents and information requests. Among the characteristics of good 
content identifiers the following are now widely agreed upon by experts 
in the field: 

a) Good content bearing words tend to occur in the documents of a 
collection with uneven frequency distributions; that is^ in certain 
documents their occurrence frequencies are much larger than would 
be expected from a random assignment of terms to documents; 
nonspecialty words, on the other hand exhibit random occurrence 
patterns in the documents of a collection. 

b) The most effective content identifiers exhibit little redundancy 
with other terms also used for content identification; in particular, 
terms with high document frequency — those assigned to a large 
proportion of the documents of a collection — tend to be 
indiscriminate in their retrieval capability and lead to losses in 
retrieval precision. *= 

c) Effective content identifiers are expected to break up large clusters 
of documents that are not otherwise distinguishable for retrieval 
purposes; that is, they should reduce the existing uncertainty for 
the given document set. Thus, terms that occur with excessively 

low document frequency in the documents of a collection are not 
optimal and lead to unacc^^jtable losses in recall. 



- The effectiveness of a retrieval system is often evaluated by two 
complementary measures known as precision and recall , -respectively, 
defined as the proportion of retrieved items that are releyant, and 
the proportion of relevant items that are retrieved. In general, an 
effective retrieval system exhibits high values for both recall and 
precision in that the user expects to retrieve a reasonable 
proportion of what is relevant while at the same time rejecting a high 
proportion of what is extraneous. 
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These considerations have given rise to a variety of automatic 
indexing strategies designed to assign appropriate content identifiers 
to the documents. of a collection and to incoming user queries. One such 
is the discrimination value method which has been used with a variety of 
document collections in different subject areas. [1,2] The best discriminators 
are invariably found to be terms of average document frequency — they occur 
normally in more than one one-hundredth of the documents of a collection, but 
less than one tenth of the collection. In the discrimination value model the 
good discriminators are assigned as content identifiers to documents and queries 
without any modifying transformation. 

Terms whose document frequency is either too high or too low often lead 
to unacceptable losses in precision and recall, respectively, and must be 
transformed into better terms by an appropriate reduction (or increase) in 
their document frequencies. Two types of frequency transformations are 
therefore introduced: 

a) a decreasing frequency transformation applicable to the high frequency 
terms which by combining such terms into term phrases produces content 
identifiei*s of lower document frequency that are more specific than 
the original phrase components; 

b) an increasing frequency transformation applicable to the low frequency 
terms which assembles such terms into classes of similar or related 
terms; by assigning such term or thesaurus classes as content 
identifiers, higher frequency, more general entities are produced than 
the original class entries. 

The main role assigned to the thesaurus by the discrimination value model 
is then as a device for assembling low frequency terms into classes in the hope 
of creating more general content identifiers that lead to improvements in the 

recall performance. 

/ 



2. The Thesaurus Method 

Before embarking on the mathematical development, it may be useful 
briefly to outline the proof procedxires and the assumptions leading to the 
results. 

Query and document vectors are assumed to be binary, that is, d^ [q.] 
equals 1 whenever term i is present in document D [query Q] , and is zero 
otherwise. The similarity function »s between queries and documents is 
assumed to be 

n - ' 

s(D, Q) = E d q. 
i=l 

where n is the vector length (the number of distinct terms in the vectors). 
For binary vectors, s represents the number of matching terms between the 
query and document vectors, respectively. 

The evaluation of the effectiveness of a particular method of term 
assignment is based on the comparison of the retrieval precision at given 
levels of the recall. Consider a specified recall level y (a specified 
proportion of relevant items retrieved), and let |r| be the total number of 
relevant items for a given query. Then the precision at recall level y 

may be defined as 

■ yIrJ 

^-y " Total number of items to be retrieved in 
order to obtain y|r| relevant ones 

A retrieval system (A) is then assumed to be superior to an alternative system 

(B) if and only if for all recall levels y^ the retrieval precision for (A) 

is at least as large as that for (B). 
/ 



The computation of makes it necessary to identify the number of 

nonreievant documents that must be retrieved for each increase of 1 in the 
number of relevant documents obtained. This in turn requires the following 
assumptions to be made regarding the occurrences of terms in the dociunents 
of the collection and the composition of the relevant and nonreievant 
document sets for each query: 

Assumption 1 : For each query, the corresponding query terms are 
assumed to be independently assigned to the documents of the collection. 
Furthermore the terms are assumed to be uniformly distributed across the 
set of relevant documents R and the set of nonreievant documents I. That 
is, the probability of occxirrence of a given term jj^ has the same value 
for all relevant documents in R; similarly the value is the same for all 
nonreievant documents in I (although the two probabilities may differ among 
themselves) . 

Thus, if one assumes that the probability of a relevant [nonreievant] 

document containing term j^^ is rjj^/|R| [ajj^/|l|], where r^^ and o^y, are 

the number of relevant and nonreievant documents, respectively, containing 

term j, , then the probability that a given relevant [nonreievant] contains 

a given term set ( j, , j^, j^) will be tt r.,/|R| [ tt a.,/|I|]. 

1 2 p j^^^ 3i< 

Assumption 2 : All documents exhibiting a given number of matching 
query-document terms have equal chance of being retrieved. That is, if 
c (c > 1) relevant items and g nonreievant items all exhibit the same 
similarity coefficient with respect to some query Q, then it is assumed 
that g/c nonreievant items are retrieved for each relevant retrieved. 
That is, the relevant items occur at even iiitervals among the nonreievant in 

/ 
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the ranked list of retrieved documents (the ranking is assumed in 
decreasing order of the query-document similarity). 

The intention of the thesaurus method is to create from each original 
query Q a new query Q^'f obtained by adding to 0 one or more terms that 
are "semantically related" to the original terms. More specifically, for 
each term q included in Q, a set of related terms is defined as the set 
of all terms included in the same thesaxirus class as q. All such related 
terms are then added to Q to form Q'^. Each of the new terms {j^, 
added to the query Q = {1, 2, m} is weighted by a factor L/i , where A < 1. 

This means that the increment in the similarity between \Q" and D due to the 
added terms will be strictly less than 1. 

One additional restriction applies to the terms supplied by the thesaiirus, ■ 
motivated by its role as a classification of low frequency, specific terms. 
The thesaurus terms must be "high precision" terms, that is, their probability 
of occurrence in the documents relevant to a given query must not be smaller 
than their probability of occxirrence in the nonrelevant items. More precisely, 
for each term j^^ included in the thesaxirus and for each query Q 

There is considerable evidence that "term precision" as defined here is inversely 
related to document frequency, and that for the low-frequency terms included 
in a thesaurus, this requir»ement is satisfied in most cases. [31 
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The main cheorein may now be stated as follows: the thesaurus method 
providing for the addition to the original user queries of semantically 
related terms laken from a thesaurus produces a superior retrieval system. 

The relevant proof appears in the next section* 

3. Thesaurus Effectiveness 

The main proof makes use of a technical lemma which may be stated as 
follows: consider a function of I terms , S^. 

(0 <_ <_ 1, i = 1, 2, i) consisting of sums of products of i terms 

each, each product containing j factors (j £ il) chosen from the set of 

and Jl-j terms (it - j ^ 0) consisting of factors (1 - S^). Specifically, let 

c(S^, S^, S.; j) = E [ TT S (k)] [ TT (1-S (k))] 

^ ^ ^ k=l ^ k=j+l ^ 

where p denotes a permutation of {1, 2, 1} and the summation covers all 

the (.) combinations of j terms out of l.*^ 
3 

Then, if S > S ' , one has 
g - g 

^Z^c<s,, S^, ... Sg.^, Sg, Sg^^, ... S^; j) 

> c(S^, S^, ... Sg_^, Sg^^, ... S^; j). 

The proof appears in the appendix. 



* For example c(S^, S^, Sg, S^^; 3) = Sg (1 - S^^) + (1 



^3^ 



+ Sg S^^ (1 - Sg) + S2 S3 S^^ (1 - S^). 
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The main theorem is true if for every recall point y the thesaurus 
method provides a retrieval precision which is not inferior to that 
obtainable with the standard (nonthesaurus) process. In a retrieval 
situation in which the retrieved documents are presented to the user in 
decreasing order of the corresponding query-document similarity coefficient, 
a recall, and hence a precision, value may be calculated following the 
retrieval of each individual document (that is, after retrieval of the first 
item; after the second item; after the third item, and so on, down to the 
last retrieved document). 

Among all the recall values obtained in this way, some are of special 
interest corresponding to the retrieval of the last document within each 
set of documents exhibiting a common number of matching query-document terms 
and including a relevant item; these special recall levels are known as standard 
recall points « A typical example showing a ranked list of retrieved documents is 
shown together with its standard recall points in Table !• 

The main theorem will be proved first for the standard recall points, 
and later for any nonstandard recall level situated between adjacent 
standard recall points. 

Theorem: The thesaurus method provides a superior retrieval system. 

Proof: Consider the situation first for all standarxi recall points 

Q. for which all documents with query document similarity greater or equal 
^1 

to i are retrieved by the original query Q. 

Any document Dj not retrieved by Q at recall point has similarity 

equal at most to (i-1) + A with Q'S where (i-1) + A < i- Such a document is 
then not retrieved by Q'"^ at standard recall point q^. On the other hand 
any relevant document Dj^ retrieved by Q at will necessarily exhibit 

Er|c 9 
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a similarity coefficient with Q- at least equal to i. Thus all relevant 
documents retrieved by Q at are also retrievable by Q-; at the same 

time nonrelevant items not retrieved by Q at q^ are also rejected by Q-. 

Consider now ar arbitrary nonstandard recall point x situated between 
q^ and the preceding standard recall point q^^^. The documents retrieved £t 
recall point x fall into two classes 

i) those whose similarity with Q is at least eq\ial to i+1, 
and ii) those whose similarity with Q is exactly equal to i. 
Lot B' and B" be the number of relevant and nonrelevant documents of type (i) 
respectively. Analogously, lot X" and X" be the number relevant and nonrelevant 
items of type (ii). If p, 0 < p <. X" , is the number of relevant documents of 
type (ii) retrieved by Q at recall point then the total number of retrieved 
docuincnls (both relevant and nonrelevant) of type (ii) will be (p/X' ) • (X"+X") 
since by Assumption 2 all documents with a given number of query-document term 
matches <ire afisutiscd to be retrievable equally easily. The precision for Q at 
recall point x is then 

• ' (1) 

Two types of documents also exist for query at recall point x, 

namely 

iii) those whose similarity with Q- is i + 1 or larger, 

and iv) those whose similarity with Q-'- is at least equal to i but 
less than i+1. 

For however, the documents of type (iv) are further subdivided into £+1 

subclasses, including those whose similarity coefficient with Q- equals 

i ^ i + (.f " ,^..) i + T5 Let the number of relevant (nonrelevant) 
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documents in the it + 1 different subclasses be X^' (X^"), TC^^'^:' (Xg")* 

respectively, with X • corresponding to similarity coefficient 
i + A, and X^^^ to similarity i« 

Obviously, the X^' relevant documents exhibit i matches with termis^ 
originally included in Q and I matches with the added terms {j^, jgj Jfc}' 
The same is true for the X^^" nonrelevant documents. The remaining document 
classes exhibit correspondingly fewer matches with the added terms • 

Since by Assumption 1 the distribution of query terms is assumed uniform 
across all relevant documents, and terms are independently assigned, it is clear 
that 

Z r., 

X • = X' • ( TT 

^ k=i |r1 



= X' 



r.. r.-. r. 
|R| |R| |R| 



where rjj^/|R| is the probability that a relevant document contains term jj^. 

Similarly, Xg'. •••» X^j^ will be equal respectively to 

X' • c(^ . .... ^ ; .... X' • c(-ii , .... ; 0). 

|r| |r| |r! |r! 

without loss of generality consider p, the number of relevant documents 

retrieved by Q at recall point x with i matching terms, such that 
y+1 y 

^ X, ' > p >_ E X, • for some integer y, 0<y<.A. + l. The proof is 
k=l ^ k=l ^ y 

given first for p = I Xj^*. For such a value of p, the precision 

value for Q** will be 

B' + p 

. (2) 



y 



k=l 
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By comparing the denominators of (1) and (2), the result follows provided that 



k=l 



y 

Since p = E X, this implies 

I (X ' + X ") 
X' + X" > k=l 



E X ' 
k=l ^ 



V 

E X, • 



or again < kfi . (3) 



X" - V 



I X " 
k=l ^ 



The summations can be replaced as shown earlier as follows 



X" - X" 



R R 



(4) 



E c(^ , ^ ; k). 

k=A-y+l |l| |l| 

Expression (4) is obviously true provided the sum in the numerator exceeds 
that of the denominator, that is, provided 

A r.. r.„ £ 0. 0. 
E ...,_M.;k)> E c(-^, ...,-^;k). 

k=A-y+l |r| |r| k=A-p+l |i| |i| 
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But, by tho term precision assumption >. , 1 £ k £ ^• 

Thus a repeated application of the results of the lemma establishes 
the result. 

y+1 V 

Consider now E X, ' > p > E X,'. The precision of the augmented 
k=l ^ k=l ^ 

query Q'- at recall point x equal to (B' + p)/|R| v/ill be 



B' + p 

(5) 



B' + B" + E + \") + 

k=l 



y 

p - E x^' 
k=i ^ 

X 

y+l 



The denominator of (5) includes all retrieved documents exhibiting at least 
i + 1 original term matches with Q-'- (that is, B' + B"), followed by the 
documents with i original term matches and up to £ - y + 1 matches through 
the added terms {j^, j^^ j^}- The right-most term in the denominator 

of (5) covers a subset of the documents exhibiting i original term matches 
and I - \i matches through the added terms. 

By comparing (1) and (5), it is seen that the performance of at 
recall point x will be at least as good as that of the original query Q if 
and only if 



p- I Xj^» 

£ (x« + X") > I (X • + x^^^) + (x^;^ + x^:;^). 



13 



13 



This is equivalent to 



^ X " P P X " 

^ p+1 k=l ^ k=l ^ ^p+1 



P P j^ii 

By adding E X, * • vT " ^ X, ' • yr "to the previous expression, one 
k=l k=l 

obtains 



p{ 



-X" X P X" X " P yll 

P+l l V Y I / + V (Y • • 2L - X, ") > 0 

p+l k=l p+1 K-1 



or finally 

V X" X," p X" 

(p - Z x ')(^-jt4^)+ E (Xj^' . ^- X,^") > 0 (6) 
k=l p+1 k=l 

P 

Z X,' 

VI U = 

Since ^ <. by equation (3), the second term of (6) is obviously 

X P 

I X " 

k=l . ^ 

greater or equcil to 0. The first factor of the left-hand term in (6) is 

P 

greater than zero since p > T. X. • for the case under consideration. Thus if 

k-1 

X"/X' > X V the theorem is established* If on the other hand 

— p+1 p+1 

X"/X* < X ",/X • , the first term of (6) becomes negative since the two factors 
p+1 p+1' 
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of the product have opposite sign^ By substituting in (6) a larger 

y+1 

value of p than that for the current ease (for example, I X, a new expression 
is obtained which is necessarily smaller than (6): 



y+1 P X" X," y X" 

k=l k=l y+1 k=l 

y 

But expression (7) covers the previously treated case where p = Z Xj^' 

k=l 

for some integer y; for that case the theorem has already been proved. 
Thus (7) is reducible to (3) and the proof is complete. | 

The proof procedure given here for the thesaurus method is usable under 
somewhat different assumptions and conditions for other retrieval techniques 
including term weighting and phrase transformations. [3,4] 
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Standard Recall Points 
(assumption: total number of relevant is 10) 
Table 1 
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Appendix 

Lemma: If S > S * it follows that 
g - g 



I C(S^, » ' **'» 3) ^ 



Proof ; 

- C(S^, S^, ...» ^g+1' 



Thus 

it 

E C(S^, S^, ^g-1' ^g' ^g+1' 

- C(S^, ^g-1' ^g+1' 
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The factor C(S^, Sj, ...» Sg_^ , S^^^, .... S^; St,) can be defined as zero, 
because one cannot factor out I terms when only A - 1 are present. 
Furthermore, all but the first term appearing in the square brackets cancel 
that is 



jf. [C(S^, S^, Sg_^, Sg^^, S^; j-1) - C(S^, S^, S^,^, S^^^, 



= C(S^, t-1) - C(S^,^y^^t) + CiS^.^^t) 



- C(S^,^^^t+l) 



= C(S^, S^; t-1), 



Thus 

9, 

C(S^, Sg_^, Sg, Sg^^, Sji; j) 

= Sg • C(S^, Sg_^, Sg^^, S^; t-1) 

A-1 

+ C(S^, S^, Sg_^, Sg^^, S^; j). 



(4) 



The lemma is an immediate consequence of the last expression. 
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